Introduction

Text mining and analytics were conducted on a dataset of scraped Reliefweb.int (RW) articles on Ukraine. Articles were limited to the year 2022 and the English language. A total of 3,895 documents were scraped, tokenised and all stop words (the, a, we, can) were removed.

A brief examination of the most common word pairs in RW articles on Ukraine yields this network graph. Only more common word pairs have been included and the thickness of the line between them indicates the number of times this pair appears in the corpus.


title network graph Full-sized graph


This is all very much expected.Titles only give us the merest of glimpses into the response and provides us no further knowledge besides could be obtained from casually watching the news. Whilst Ukraine is obviously central to the corpus, as are war and situation report we see smaller, but quite meaningful clusters. These include fiscal and fy as well as snapshot, funding and appeal.

However, we see that most of the corpus deals with situation updates or press releases (see the centrality of media and government). Relatively limited is information on achievements – estimated, reached, assistance, cash and relief do not form a large proportion of the titles.

If we take a look at the most common word pairs in the text of the scraped articles, it doesn’t provide too much additional detail. As with the graph above, the thickness of the line indicates the nubmer of co-occurences.


word pair network graph Full-sized graph


If I knew nothing about the situation in Ukraine, from this graph, I can glean that there is a war and a humanitarian response to it. I see multiple sectors being mentioned, as well as refugees. The word million shows up, as does scale.

I suppose this would work as a primer in other emergencies, but we will need a way to sort through the boilerplate (we will get to this later).

Let us, finally, take a macro-view of the dataset and plot, in a bit more detail, the correlations between word pairs – bigrams – within the corpus. So that we may get a lay of the land, so to speak.

This network graph is not only much more complex, but it is also formed of word pairs – bigrams – as this tends to improve interpretability at the cost of sensitivity. However, now the main patterns in the RW corpus are visible. This is the lay of the land, so to speak.


network graph kk full Download full-sized graph


Any bigram that appears here has at minimum occurred 50 times in the corpus and has at least a 0.15 correlation with at least one other bigram (that they can be found in the same document 15% of the time).

Immediately, we note that large proportions of RW are tied to conflict reporting (coming largely from OSCE and ACLED) as well as numerous releases from the IAEA. Both are imporant, but it would be far easier to analyse them separately, and, in the case of ACLED, it is much better to analyse the content of each conflict event’s description, as opposed to using the general weekly ACLED report that RW has been uploading – the issue there being words like Kazakhstan or Tibet come up because ACLED weekly reports cover all countries.

We can now see the full spectrum of rhetoric on Ukraine – respect international, humanitarian law, recognised borders, war crimes, and of course, human rights are front and centre. I wish these Europeans would be as concerned about human rights in other parts of the world as well.

But we also start to see certain keywords of high importance to persons in the industry, frequently appearing in the news: black sea grain initiative, humanitarian corridors and gender based violence.




Months

So, now that we can see grand strokes, of the corpus, how do we extract a bit more value from this dataset? One easy win is to look at this emergency through the lens of time – it is a hot conflict, after all.

Let’s start by looking at a simple GIF of conflict events across the whole of 2022, based on the ACLED dataset:


event type gif Full-sized GIF


So we can see that there are clear patterns to the violence. Milbloggers definitely have much better analysis there. But, applying this to the RW corpus, let us use log-odds to separate out bigrams by month. The logarithm of the odds that a word appears in a given month in the metric by which we use to determine which words are the most unique to each month.

This is the first product we have developed that is easily interepretable and adds significantly to our understanding of the crisis.


bigram months Full-sized plot




Sources

As demonstrated by the tightly-knit clusters of bigrams surrounding IAEA, ACLED and OSCE submissions, the source matters a lot. Furthermore, we would also dearly believe that each of the agencies we work for has its own specialities and particularities that make it our preferred actor.



Tf, tf-idf and log-odds

We will explore what each source has written by looking at:

  • Term frequency
  • Term frequency-indirect document frequency
  • Log-odds

Term frequency here shows us the phrases that each organisation uses most commonly. Often, their name or their mandate (“food assistance”) will appear in this list.

Tf-idf (term frequency-inverse document frequency) is another measure by which significant words are identified. The term frequency – the number of times a term appears in the corpus – is tempered by the inverse document frequency, which discounts words that tend to appear again and again (think “humanitarian assistance” or “including women”). The combined metric is useful for determining which words are common, but not too common. A suspicious-minded person might even say that this is where we may find what agencies really care about, as opposed to the boilerplate that they flood every report with.

Finally, we also evaluate words within a corpus by looking at their log-odds. What this means, for this section, is that the words appearing in under the log-odds are more likely to originate from that source over any other sources. This is where we see what unique information each source brings to the table.



Bigram plots by source

A series of plots has been generated for each source with more than 5 contributions to RW in 2022 about Ukraine. Each plot shows the top bigrams by term frequency, tf-idf and log-odds.

The organisations below are just a small selection of all bigram plots by source. All the plots may be viewed here.

If you see a bigram you’re curious about, just make use of the Bigram Search Helper. Just type the bigram or source or date into the relevant searchbox.



ACAPS

Let’s start with ACAPS, because it comes first alphabetically and most people would have read at least some of their material.

Overall, it seems that ACAPS is concerned foremost with access constraints, with the top 3 of its tf-idf list being rounded out by active ground conflict. ACAPS also shows which sources it prizes highly – REACHand UNHCR, likely influencing the placement of building materials at the top of the bigrams with the highest log-odds. Strangely, this is not reflected in UNHCR’s own list of bigrams.



Plan International


Plan International is a good example of a responsible investment. Their log-odds mentions a [multi-]sectoral response, as well as the statement [regardless of?] religion country age ability sexuality perceived differences, indicating a strong (or at least more vocal) commitment to human rights, though it could just be boilerplate (we’d have to check further down the Term frequency column). Other language such as provide holistic and initial service indicates fair programmatic thinking. All this makes Plan International a strong candidate if I were looking for a full-service NGO.



Malteser International

Much less informative are the documents originating from Malteser International. It just seems like a lot of press releases mentioning senior staff. Whilst there would definitely be much more detail in a report, I would also not really prioritise reading any of their products. Look at Christian Aid for another negative example.


European Commission


Changing lanes a bit, the EC’s rhetoric seems to be very consistent. The acceptance of hryvnia banknotes and solidarity lanes are clearly an important pieces of the response that the EC would like to highlight. And yes, these actions are laudable, but I also feel very irritated to have to ask why we can do this for white Europeans but not the Rohingya. We could have arranged for their gold to be sold at fair prices.

Otherwise, the EC cares about VET schools, vocational education and eu4skills protection. This also indicates a level of investment in human capital (after all, they might one day be EU citizens) that is absent in other parts of the world. Not to shame Europe, but just goes to show how useless ASEAN is.


Govt. of Ukraine


I’d also question the “curation” that has occurred with the Govt. of Ukraine’s statements on RW. I’m not really sure that this is what the Ukrainian government would like to communicate to the humanitarian community. Unless educational institutions are a cornerstone of the response, which funding indicates that they are not. The focus on sexual violence is less puzzling as the Ukrainian government might be signalling for humanitarian actors to move more fully into that space to complement state services.

Simply, we do not know and would need to ask someone there.



USAID


Perhaps one of the most interesting things to check would be to see how well a donor’s rhetoric aligns with their actions and their funding disbursements. Looking at the term frequencies, one could charitably interpret them USAID’s main concerns as (in order), everything below gorf attacks does paint the picture of a typical donor. Does USAID mentioning health facilities more than food assistance mean that WHO will be getting more money than WFP? We’ll check FTS.

Also unsurprisingly, USAID seems to be much more vocal about Russian attacks, which is the privilege of the world’s only superpower. There is also much importance accorded to a USAID-funded survey on societal attitudes in Ukraine. Did they pay for the survey to send a signal to that they’re keen on social engineering?



ACLED


Finally, let’s take a look at ACLED. It should be quite apparent why I would want to analyse ACLED conflict event descriptions on their own – what has been uploaded to RW are just ACLED’s weekly reports, which cover Ukraine, but other things as well, such as presidential elections, the overturn of roe as well as mentions of germany police. This corpus is polluted. Furthermore, it provides irrelevant data to persons who search up ACLED on RW – muddying the waters on one of main sources of incident tracking is quite unforgivable.

See here for a bigram network graph based on ACLED conflict event descriptions, instead of weekly collated reports. Immediately, one can see that the level of detail and utility is markedly different and the ACLED dataset contains much richer textual data than what has been uploaded to RW.




Bigram search helper




Pairwise correlations between words